Skip to content

Add Cortex-M as a first-class target in aot_arm_compiler#17075

Open
psiddh wants to merge 2 commits intopytorch:mainfrom
psiddh:main
Open

Add Cortex-M as a first-class target in aot_arm_compiler#17075
psiddh wants to merge 2 commits intopytorch:mainfrom
psiddh:main

Conversation

@psiddh
Copy link
Contributor

@psiddh psiddh commented Jan 30, 2026

Add a dedicated to_edge_cortex_m() path selected via --target=cortex-m that
owns the full pipeline: CortexMQuantizer for INT8 quantization, correct
EdgeCompileConfig with preserve_ops to prevent premature decomposition, and
CortexMPassManager.pass_list for op conversion. Remove the old scattered
transform_for_cortex_m_backend() function.

Verified all ops fully lowered to cortex_m::quantized_* operators for both
MobileNetV2 (70 nodes) and MobileNetV3 (122 nodes). E2E inference tested
on Alif E8 board.

Test Plan:
- python3 -m examples.arm.aot_arm_compiler -m mv2 --target=cortex-m --quantize --intermediates=./mv2_intermediates --output=./mv2_cortex_m.pte
- python3 -m examples.arm.aot_arm_compiler -m mv3 --target=cortex-m --quantize --intermediates=./mv3_intermediates --output=./mv3_cortex_m.pte

Also ran E2E inference on Alif E8 board

@pytorch-bot
Copy link

pytorch-bot bot commented Jan 30, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17075

Note: Links to docs will display an error until the docs builds have been completed.

❌ 7 New Failures, 3 Cancelled Jobs, 2 Unrelated Failures

As of commit e6fd05b with merge base f06a1f6 (image):

NEW FAILURES - The following jobs have failed:

CANCELLED JOBS - The following jobs were cancelled. Please retry:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 30, 2026
@github-actions
Copy link

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@psiddh psiddh force-pushed the main branch 6 times, most recently from 39666cd to 7f14a9d Compare February 4, 2026 09:06
@zingo zingo added partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm ciflow/trunk module: microcontrollers For embedded MCUs like Cortex-M, or RTOS like Zephyr, does not track NPU backend like Arm Ethos. labels Feb 5, 2026
@psiddh psiddh force-pushed the main branch 5 times, most recently from 1b64ef3 to 41462be Compare February 6, 2026 07:48
)

# Cortex-m ops are never included in vgf or direct-drive
if args.target != "vgf" and not args.direct_drive:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should TOSA targets even have CortexM fallback ? ( --target=u55/u85 → TOSA delegation)

@psiddh psiddh changed the title Summary:MV2 CortexM PassManager changes for Alif E8 Cortex-M: Enable full MobileNetV2 lowering to CMSIS-NN backend via Aot Compiler script Feb 6, 2026
@psiddh psiddh marked this pull request as ready for review February 6, 2026 07:56
@psiddh psiddh requested a review from digantdesai as a code owner February 6, 2026 07:56
Copilot AI review requested due to automatic review settings February 6, 2026 07:56
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enables full MobileNetV2 lowering to the CMSIS-NN backend for Cortex-M microcontrollers by implementing comprehensive support for quantized operations through a dedicated compilation path. The changes replace the previous delegation-based approach with a portable kernel-based architecture that converts all quantized operations to cortex_m::* operators.

Changes:

  • Added dedicated Cortex-M compilation path (to_edge_cortex_m) in the AOT compiler with CortexMQuantizer-based quantization
  • Implemented addmm operator support for decomposed linear layers through new _get_addmm_replacement method
  • Enhanced quantization parameter propagation with new PropagateQParamsPass and passthrough op handling in FoldAndAnnotateQParamsPass
  • Extended quantizer to mark parameter nodes as annotated and added passthrough ops (hardtanh, max_pool2d, dropout)

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
examples/arm/aot_arm_compiler.py Adds to_edge_cortex_m function for Cortex-M compilation path using CortexMQuantizer and removes old transform_for_cortex_m_backend function
backends/cortex_m/quantizer/quantizer.py Adds _mark_param_node_as_annotated method and extends passthrough ops list for MobileNetV2 support
backends/cortex_m/passes/propagate_qparams_pass.py New pass to propagate qparams through passthrough ops (transpose/permute) to consumer nodes like addmm
backends/cortex_m/passes/cortex_m_pass_manager.py Adds PropagateQParamsPass and DecomposeAdaptiveAvgPool2dPass to pass list, adds skip_passes parameter to __init__
backends/cortex_m/passes/convert_to_cortex_m_pass.py Implements _get_addmm_replacement method to convert decomposed linear (addmm) operations to cortex_m.quantized_linear
backends/arm/_passes/fold_qdq_with_annotated_qparams_pass.py Adds passthrough ops (hardtanh, relu, clamp) support and second-pass qparams propagation logic

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@psiddh psiddh force-pushed the main branch 2 times, most recently from d7d85fb to b222911 Compare February 6, 2026 09:09
Copilot AI review requested due to automatic review settings February 6, 2026 09:09
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 358 to 372
def _mark_param_node_as_annotated(self, node: Node) -> None:
"""
Mark a weight/bias parameter node as annotated.

This is necessary for FoldAndAnnotateQParamsPass to recognize the node
as part of a quantized computation path. The ARM quantizer does this
via mark_annotated=True in _QuantProperty.
"""
if Q_ANNOTATION_KEY not in node.meta:
node.meta[Q_ANNOTATION_KEY] = QuantizationAnnotation()
node.meta[Q_ANNOTATION_KEY]._annotated = True
annotation_info = ArmAnnotationInfo(quantized=True)
meta_custom = node.meta.get("custom", {})
meta_custom[ArmAnnotationInfo.CUSTOM_META_KEY] = dict(annotation_info)
node.meta["custom"] = meta_custom
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation of _mark_param_node_as_annotated duplicates the exact logic from mark_node_as_annotated in backends/arm/quantizer/arm_quantizer_utils.py. Consider importing and reusing the existing function instead of duplicating the code to improve maintainability and reduce the risk of divergence.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

@AdrianLundell AdrianLundell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, this PR needs major changes I'm afraid.

  1. The changes to fold_qdq_with_annotated_qparams_pass and propagate_qparams_pass are very likely not needed, rather they are masking a faulty implementation either of the add_mm or the integration in the aot_arm_compiler.
  2. The addition of the add_mm is a significant change which should be made in a separate PR properly tested with unittests as is done with all other ops.
  3. It would be great to add mv2 also as a pytest similar to mv3, in fact I would suggesting starting to get that working before adding support to the aot_arm_compiler since the compilation pipeline is guaranteed to be working there.

@psiddh
Copy link
Contributor Author

psiddh commented Feb 6, 2026

Hi, this PR needs major changes I'm afraid.

  1. The changes to fold_qdq_with_annotated_qparams_pass and propagate_qparams_pass are very likely not needed, rather they are masking a faulty implementation either of the add_mm or the integration in the aot_arm_compiler.
  2. The addition of the add_mm is a significant change which should be made in a separate PR properly tested with unittests as is done with all other ops.
  3. It would be great to add mv2 also as a pytest similar to mv3, in fact I would suggesting starting to get that working before adding support to the aot_arm_compiler since the compilation pipeline is guaranteed to be working there.

Sure - I agree with the approach. I just wanted to share the work I've been up to recently so
that we can have exactly this kind of discussion.

Context on the design choice:

The Cortex-M backend keeps addmm directly (vs ARM's decomposition to Conv2D) to leverage CMSIS-NN's optimized linear
kernels. This creates a qparam propagation challenge:

When PyTorch decomposes nn.Linear to edge dialect:
linear(input, weight, bias) → addmm(bias, input, weight.T)

The weight flows through a transpose before reaching addmm:
weight → permute_copy → addmm

FoldAndAnnotateQParamsPass folds the DQ into permute, but output_qparams remains empty (no Q node after permute).
The addmm node expects weight qparams at input_qparams[2], hence PropagateQParamsPass bridges this gap.

Proposed approach:

  1. I'll first add test_addmm.py and test_mobilenet_v2.py unit tests following the existing patterns
  2. Once those pass and validate the pipeline, we can review whether PropagateQParamsPass is the right solution or if
    there's a cleaner approach
  3. The aot_arm_compiler integration can follow in a subsequent PR

This way we have proper test coverage before discussing the implementation details. Let me get the unit tests
working first.

@psiddh psiddh marked this pull request as draft February 6, 2026 17:47
@AdrianLundell
Copy link
Collaborator

Sounds good!

When PyTorch decomposes nn.Linear to edge dialect:
linear(input, weight, bias) → addmm(bias, input, weight.T)

The weight flows through a transpose before reaching addmm:
weight → permute_copy → addmm

I think the issue here is that you are not using the EdgeCompileConfig used in the tester:

 config = EdgeCompileConfig(
            preserve_ops=[
                torch.ops.aten.linear.default,
                torch.ops.aten.hardsigmoid.default,
                torch.ops.aten.hardsigmoid_.default,
                torch.ops.aten.hardswish.default,
                torch.ops.aten.hardswish_.default,
            ],
            _check_ir_validity=False,
            _core_aten_ops_exception_list=[torch.ops.aten.max_pool2d.default],
        ) 

When linear is not decomposed you avoid the issues around q/dq folding. In general the design philosophy is that we want to make the decompositions and annotations to get correct q/dq values directly rather than handling special cases in the folding, as that gets complex very quickly from our previous experience in the arm backend.

Previously, Cortex-M op conversion was applied as an afterthought to all
non-vgf targets via transform_for_cortex_m_backend(). This made the flow
hard to follow, used a bare EdgeCompileConfig that decomposed ops like
linear into addmm (requiring unnecessary workarounds), and didn't use the
CortexMQuantizer or CortexMPassManager.

Add a dedicated to_edge_cortex_m() path selected via --target=cortex-m that
owns the full pipeline: CortexMQuantizer for INT8 quantization, correct
EdgeCompileConfig with preserve_ops to prevent premature decomposition, and
CortexMPassManager.pass_list for op conversion. Remove the old scattered
transform_for_cortex_m_backend() function.

Verified all ops fully lowered to cortex_m::quantized_* operators for both
MobileNetV2 (70 nodes) and MobileNetV3 (122 nodes). E2E inference tested
on Alif E8 board.

Test Plan:
    python3 -m examples.arm.aot_arm_compiler -m mv2 --target=cortex-m --quantize --intermediates=./mv2_intermediates --output=./mv2_cortex_m.pte
    python3 -m examples.arm.aot_arm_compiler -m mv3 --target=cortex-m --quantize --intermediates=./mv3_intermediates --output=./mv3_cortex_m.pte

    Also ran E2E inference on Alif E8 board
@psiddh psiddh changed the title Cortex-M: Enable full MobileNetV2 lowering to CMSIS-NN backend via Aot Compiler script Add Cortex-M as a first-class target in aot_arm_compiler Feb 19, 2026
@psiddh psiddh marked this pull request as ready for review February 19, 2026 16:55
Copilot AI review requested due to automatic review settings February 19, 2026 16:55
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 2 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +852 to +859
pass_instances = []
for pass_cls in CortexMPassManager.pass_list:
sig = inspect.signature(pass_cls.__init__)
if "exported_program" in sig.parameters:
pass_instances.append(pass_cls(edge.exported_program()))
else:
pass_instances.append(pass_cls())
edge = edge.transform(pass_instances)
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manual pass instantiation duplicates logic from CortexMPassManager.transform(). The code here manually inspects each pass class and instantiates it based on whether it accepts an exported_program parameter, which duplicates the exact same logic already present in CortexMPassManager.transform(). Consider simplifying this by directly using the CortexMPassManager instead of manually instantiating passes. For example: pass_manager = CortexMPassManager(edge.exported_program()); edge_ep = pass_manager.transform(); edge = EdgeProgramManager({\"forward\": edge_ep}, ...)

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already tried the CortexMPassManager approach, and it broke things — CortexMPassManager.transform() returns an ExportedProgram, not an EdgeProgramManager. The edge object was left untransformed, resulting in 351 raw aten ops.

The cleanest approach: add an instantiate_passes method to CortexMPassManager that extracts the inspect logic into a reusable method. Then both transform() and to_edge_cortex_m() can use it without duplication, which can be a follow up PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. module: microcontrollers For embedded MCUs like Cortex-M, or RTOS like Zephyr, does not track NPU backend like Arm Ethos. partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments